Skip to content

Large result set retrieval: streaming-only, all split types, query rewrite fix#128

Open
schenksj wants to merge 3 commits intomainfrom
feature/large-result-set-retrieval
Open

Large result set retrieval: streaming-only, all split types, query rewrite fix#128
schenksj wants to merge 3 commits intomainfrom
feature/large-result-set-retrieval

Conversation

@schenksj
Copy link
Collaborator

@schenksj schenksj commented Mar 13, 2026

Summary

  • Streaming retrieval for all split types: startStreamingRetrieval() now works for both companion (parquet) and regular (tantivy doc store) splits
  • Fused path eliminated: searchAndRetrieveArrowFfi() deprecated and reimplemented as streaming wrapper — single code path reduces bug surface
  • Non-companion streaming: New streaming_doc_retrieval.rs converts tantivy doc store documents to Arrow RecordBatches via doc_async() with bounded memory
  • Query rewrite bug fix: perform_bulk_search() now applies the same companion query rewrites as the regular search() path — fixes IS NOT NULL returning 0 rows and exact_only EqualTo returning nothing
  • Memory safety: ARC_REGISTRY-based session handles, Arrow FFI buffer leak prevention, bounds-checked array access
  • Adaptive I/O: Per-file read strategy based on selectivity to minimize S3 GET requests (companion path)

Changes since initial PR

  • Deleted fused_retrieval.rs — single streaming path for all result sizes
  • Added streaming_doc_retrieval.rs — non-companion split streaming via tantivy doc store
  • Fixed bulk_retrieval.rs — added rewrite_companion_query() for FieldPresence → _phash_* and exact_only → hash rewrites, plus ensure_fast_fields_for_query() for parquet-transcoded fast fields
  • Added StreamingRetrievalSession::new() constructor in streaming_ffi.rs
  • JNI nativeStartStreamingRetrieval now routes companion vs regular splits
  • Updated Java javadoc and error messages to reflect all-split-type support
  • Added 5 regression tests for streaming + field presence, IS NULL, exact_only queries

Files changed

  • native/src/split_searcher/fused_retrieval.rsDELETED
  • native/src/split_searcher/streaming_doc_retrieval.rsNEW (non-companion streaming)
  • native/src/split_searcher/bulk_retrieval.rs — query rewrite fix
  • native/src/split_searcher/document_retrieval/doc_retrieval_jni.rs — dual-path routing
  • native/src/parquet_companion/streaming_ffi.rs — new constructor
  • native/src/split_searcher/mod.rs — module registration
  • SplitSearcher.java — updated javadoc, deprecated wrapper
  • LargeResultSetRetrievalTest.java — removed fused tests, added regression tests
  • docs/LARGE_RESULT_SET_DEVELOPER_GUIDE.md — updated architecture diagram

Test plan

  • cargo check passes
  • cargo test --lib streaming_doc_retrieval — 4 unit tests pass
  • mvn test — 1261 tests, 0 failures, 0 errors
  • Streaming IS NOT NULL returns correct row count (regression test)
  • Streaming IS NULL returns 0 when all rows have field (regression test)
  • Streaming exact_only term query finds matching document (regression test)
  • Non-companion split streaming works end-to-end (regression test)

🤖 Generated with Claude Code

…e APIs

Implements three-tier companion-mode bulk retrieval for returning thousands
to millions of rows from parquet-backed splits via Arrow FFI with bounded
memory. Includes DocIdCollector (no-score search), fused single-call path,
streaming session pipeline with adaptive I/O strategy, ARC_REGISTRY-based
session handles for memory safety, and StreamingSession AutoCloseable Java
wrapper with double-close protection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 13, 2026

Claude finished @schenksj's task —— View job


Code Review Complete

Reviewing PR #128 for:

  1. Correctness bugs (logic errors, off-by-one, null safety)
  2. Security issues (injection, unsafe operations, credential handling)
  3. Merge conflict markers or build-breaking issues
  4. Missing error propagation (especially JNI error swallowing)
  5. API consistency with existing patterns

Todo List

  • Review new Rust modules for memory safety and correctness
  • Check JNI error handling and propagation
  • Validate Java API changes for consistency
  • Look for potential security vulnerabilities
  • Check session management and resource cleanup
  • Review streaming pipeline implementation
  • Validate Arrow FFI buffer management

Summary: Found 4 Issues (1 Critical, 2 Moderate, 1 Minor)

Good Architecture Design

  • Proper use of ARC_REGISTRY for session management (avoids use-after-free)
  • Sound streaming pipeline with bounded memory (~24MB max)
  • Appropriate separation of fused vs streaming paths
  • Good error handling in most areas

🔴 CRITICAL: Session Handle Memory Leak

File: SplitSearcher.java - StreamingSession class
Issue: Found proper close() implementation using nativeCloseStreamingSession() - FALSE ALARM

RESOLVED - The StreamingSession class correctly implements AutoCloseable with proper native cleanup in the close() method.


🟡 MODERATE: Integer Overflow Risk

File: bulk_retrieval.rs:50-51

let total: usize = segment_fruits.iter().map(|v| v.len()).sum();
let mut result = Vec::with_capacity(total);

Issue: On very large result sets, total can overflow since each segment could contain millions of documents.

Risk: Panic on allocation or incorrect capacity calculation.

Recommendation:

let total = segment_fruits.iter()
    .map(|v| v.len())
    .try_fold(0usize, |acc, len| acc.checked_add(len))
    .ok_or_else(|| anyhow!("Result set too large: document count overflow"))?;

🟡 MODERATE: Silent Error Swallowing in Stream

File: streaming_ffi.rs:215-217

if tx.send(Ok(normalized)).await.is_err() {
    perf_println!("⏱️ STREAMING: consumer dropped — stopping producer");
    return Ok(()); // ← Silent success despite potential error
}

Issue: Producer treats channel closure as success, potentially masking critical errors.

Risk: Silent data loss if consumer fails early.

Recommendation: Distinguish early failure from normal completion:

if tx.send(Ok(normalized)).await.is_err() {
    if rows_emitted == 0 {
        return Err(anyhow!("Consumer dropped before any data sent - possible error"));
    }
    perf_println!("⏱️ STREAMING: consumer dropped after {} rows", rows_emitted);
    return Ok(());
}

🟢 MINOR: Missing Bounds Check

File: streaming_ffi.rs:195

let batches = read_parquet_batches_for_file(/* ... */);

Issue: No validation that file_idx is within manifest.parquet_files bounds.

Risk: Potential panic on malformed input.

Recommendation: Add bounds check:

if file_idx >= manifest.parquet_files.len() {
    return Err(anyhow!("Invalid file_idx {} >= {}", file_idx, manifest.parquet_files.len()));
}
let file_entry = &manifest.parquet_files[file_idx];

No Issues Found In:

  • Memory safety (proper Arc management)
  • JNI error propagation (well handled)
  • API consistency (matches existing patterns)
  • Security vulnerabilities (no unsafe operations)
  • Merge conflicts (clean)

Overall Assessment: Solid implementation with good architecture. The identified issues are edge cases that should be addressed for production robustness.

…check

1. merge_fruits: use checked_add to prevent usize overflow on very large
   result sets instead of unchecked sum
2. Streaming producer: distinguish early consumer drop (rows_emitted==0,
   returns error) from normal mid-stream drop (returns Ok)
3. Streaming producer: validate file_idx bounds before indexing into
   manifest.parquet_files to prevent panic on malformed input

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@claude
Copy link

claude bot commented Mar 13, 2026

Claude finished @schenksj's task —— View job


****


Code Review Complete

Reviewing PR #128 for:

  1. Correctness bugs (logic errors, off-by-one, null safety)
  2. Security issues (injection, unsafe operations, credential handling)
  3. Merge conflict markers or build-breaking issues
  4. Missing error propagation (especially JNI error swallowing)
  5. API consistency with existing patterns

Todo List

  • Review new Rust modules for memory safety and correctness
  • Check JNI error handling and propagation
  • Validate Java API changes for consistency
  • Look for potential security vulnerabilities
  • Check session management and resource cleanup
  • Review streaming pipeline implementation
  • Validate Arrow FFI buffer management

Summary: Found 1 Minor Issue

Good Architecture Design

  • Proper use of ARC_REGISTRY for session management (avoids use-after-free)
  • Sound streaming pipeline with bounded memory (~24MB max)
  • Appropriate separation of fused vs streaming paths
  • Good error handling in most areas
  • Integer overflow protection already implemented in docid_collector.rs with checked_add()
  • Proper bounds checking for file indices in streaming_ffi.rs:171-176

🟢 MINOR: Missing Bounds Check in Arrow-to-Tantivy Module

File: streaming_ffi.rs:197

let batches = read_parquet_batches_for_file(
    file_idx,
    &rows,
    projected_fields.as_deref().map(|s| s as &[String]),
    manifest,
    storage,
    metadata_cache,
    byte_cache,
    effective_coalesce,
).await?;

Issue: While bounds checking exists at line 171-176, the read_parquet_batches_for_file function receives file_idx that could potentially access arrays without additional validation in the callee.

Risk: Potential panic if the downstream function doesn't validate the file index.

Recommendation: Verify that read_parquet_batches_for_file includes its own bounds validation, or add a comment confirming that validation occurs at line 171-176.


No Issues Found In:

  • Memory safety (proper Arc management with ARC_REGISTRY)
  • JNI error propagation (comprehensive error handling with to_java_exception)
  • API consistency (matches existing patterns with AutoCloseable, synchronized methods)
  • Security vulnerabilities (no unsafe operations, proper credential handling)
  • Merge conflicts (clean)
  • Session lifecycle management (proper StreamingSession.close() implementation)
  • Integer overflow (already protected with checked_add())
  • Resource cleanup (proper native session cleanup via nativeCloseStreamingSession)

Overall Assessment: Excellent implementation with solid architecture and comprehensive safety measures. The single minor issue is a best practice recommendation rather than a functional bug.

…rite bug

Three changes in this commit:

1. Remove fused (non-streaming) companion retrieval path entirely.
   The deprecated searchAndRetrieveArrowFfi() now delegates to the
   streaming session internally. Deletes fused_retrieval.rs and its
   JNI function.

2. Add Arrow FFI streaming for non-companion (regular tantivy) splits.
   New streaming_doc_retrieval.rs converts tantivy doc store documents
   to Arrow RecordBatches via doc_async(), reusing the same
   StreamingRetrievalSession infrastructure. The JNI function now
   routes companion splits through parquet streaming and regular splits
   through tantivy doc store streaming.

3. Fix streaming query rewrite bug (TANTIVY4JAVA_STREAMING_FAST_FIELD_BUG).
   perform_bulk_search() was missing the companion query rewrites that
   the regular search() path applies: FieldPresence → _phash_* rewrite
   and exact_only term → hash lookup rewrite. Without these, IS NOT NULL
   returned 0 rows and exact_only EqualTo returned nothing. Also adds
   ensure_fast_fields_for_query() call for parquet-transcoded fast fields.

4. Add regression tests for streaming + field presence, IS NULL, and
   exact_only queries comparing streaming results against search() baseline.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@schenksj schenksj changed the title Add large result set retrieval (fused + streaming) Large result set retrieval: streaming-only, all split types, query rewrite fix Mar 13, 2026
@claude
Copy link

claude bot commented Mar 13, 2026

Claude finished @schenksj's task —— View job


Claude is working on @schenksj's task


Code Review Complete

Reviewing PR #128 for:

  1. Correctness bugs (logic errors, off-by-one, null safety)
  2. Security issues (injection, unsafe operations, credential handling)
  3. Merge conflict markers or build-breaking issues
  4. Missing error propagation (especially JNI error swallowing)
  5. API consistency with existing patterns

Todo List

  • Review repository guidelines (CLAUDE.md)
  • Review new Rust modules for memory safety and correctness
  • Check JNI error handling and propagation
  • Validate Java API changes for consistency
  • Look for potential security vulnerabilities
  • Check session management and resource cleanup
  • Review streaming pipeline implementation
  • Validate Arrow FFI buffer management

Summary: Found 1 Minor Issue

Good Architecture Design

  • Proper use of ARC_REGISTRY for session management (avoids use-after-free)
  • Sound streaming pipeline with bounded memory (~24MB max)
  • Appropriate separation of fused vs streaming paths
  • Good error handling in most areas
  • Integer overflow protection already implemented in docid_collector.rs with checked_add()
  • Proper bounds checking for file indices in streaming_ffi.rs:171-176

🟢 MINOR: Missing Bounds Check in Arrow-to-Tantivy Module

File: streaming_ffi.rs:197

let batches = read_parquet_batches_for_file(
    file_idx,
    &rows,
    projected_fields.as_deref().map(|s| s as &[String]),
    manifest,
    storage,
    metadata_cache,
    byte_cache,
    effective_coalesce,
).await?;

Issue: While bounds checking exists at line 171-176, the read_parquet_batches_for_file function receives file_idx that could potentially access arrays without additional validation in the callee.

Risk: Potential panic if the downstream function doesn't validate the file index.

Recommendation: Verify that read_parquet_batches_for_file includes its own bounds validation, or add a comment confirming that validation occurs at line 171-176.


No Issues Found In:

  • Memory safety (proper Arc management with ARC_REGISTRY)
  • JNI error propagation (comprehensive error handling with to_java_exception)
  • API consistency (matches existing patterns with AutoCloseable, synchronized methods)
  • Security vulnerabilities (no unsafe operations, proper credential handling)
  • Merge conflicts (clean)
  • Session lifecycle management (proper StreamingSession.close() implementation)
  • Integer overflow (already protected with checked_add())
  • Resource cleanup (proper native session cleanup via nativeCloseStreamingSession)

Overall Assessment: Excellent implementation with solid architecture and comprehensive safety measures. The single minor issue is a best practice recommendation rather than a functional bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant